| Variable Name | Explanation |
|---|---|
| ID | Number corresponding to the precise combination of the features of the model |
| Model Year | Year of the model of the car |
| Make | Brand of the car |
| Model | The model of the car |
| Estimated Annual Petroleum Consumption (Barrels) | Consumption in Petroleum Barrels |
| Fuel Type 1 | First fuel energy source, only source if not an hybrid car |
| City MPG (Fuel Type 1) | Consumption of the car in miles per gallon of fuel when driving in a city, for fuel type 1 |
| Highway MPG (Fuel Type 1) | Consumption of the car in miles per gallon of fuel when driving on a highway, for fuel type 1 |
| Combined MPG (Fuel Type 1) | Combined city and highway car consumption in miles per gallon, for fuel type 1 |
| Fuel Type 2 | Second energy source if hybrid car |
| City MPG (Fuel Type 2) | Consumption of the car in miles per gallon of fuel when driving in a city, for fuel type 2 |
| Highway MPG (Fuel Type 2) | Consumption of the car in miles per gallon of fuel when driving on a highway, for fuel type 2 |
| Combined MPG (Fuel Type 2) | Combined city and highway car consumption in miles per gallon, for fuel type 2 |
| Engine Cylinders | Number of cylinders of the car |
| Engine Displacement | Measure of the cylinder volume swept by all of the pistons of a piston engine, excluding the combustion chambers |
| Drive | Type of powertrain distribution system that places rotational propulsion, such as rear-wheel,4-Wheel Drive,... |
| Engine Description | Description of some features of the car, such as turbo engine, Stop-Start system, ... |
| Transmission | Manual/Automatic transmission, with number of gears and/or model of transmission |
| Vehicle Class | Type of vehicle, such as Minivan, Trucks,... |
| Time to Charge EV (hours at 120v) | Number of hours required to fully charge an EV car at 120v |
| Time to Charge EV (hours at 240v) | Number of hours required to fully charge an EV car at 240v |
| Range (for EV) | Maximum number of miles possible with a fully charged EV car |
| City Range (for EV - Fuel Type 1) | Maximum number of miles possible with a fully charged EV car in a city |
| City Range (for EV - Fuel Type 2) | Maximum number of miles possible while only using electricity with a fully charged hybrid car in a city |
| Hwy Range (for EV - Fuel Type 1) | Maximum number of miles possible with a fully charged EV car on a highway |
| Hwy Range (for EV - Fuel Type 2) | Maximum number of miles possible while only using electricity with a fully charged hybrid car on a highway |
1 Introduction
1.0.1 The context and background: course, company name, business context.
During our 1st master year as students in Management - orientation Business Analytics, we have had the opportunity to attend some lectures of Machine Learning for Business Analytics. In content of this class, we have seen multiple machine learning techniques for business context, mainly covering supervised (regressions, trees, support vector machine, neural networks) and unsupervised methods (clustering, PCA, FAMD, Auto-Encoder) but also other topics such as data splitting, ensemble methods and metrics.
1.0.2 Aim of the investigation: major terms should be defined, the question of research (more generally the issue), why it is of interest and relevant in that context.
In the context of this class, our group have had the opportunity to work on an applied project. From scratch, we had to look for some potential dataset for using on real cases what we have learned in class. Thus, we had found an interesting dataset concerning vehicule MPG, range, engine stats and more, for more than 100 brands. The goal of our research was to predict the make (i.e. the brand) of the car according to its characteristics (consumption, range, fuel type, … ) thanks to a model that we would have trained (using RF, ANN or Trees). As some cars could have several identical characteristics, but could differentiate on various other ones, we thought that it would be pertinent to have a model that was able to predict a car brand, from its features.
1.0.3 Description of the data and the general material provided and how it was made available (and/or collected, if it is relevant). Only in broad terms however, the data will be further described in a following section. Typically, the origin/source of the data (the company, webpage, etc.), the type of files (Excel files, etc.), and what it contains in broad terms (e.g. “a file containing weekly sales with the factors of interest including in particular the promotion characteristics”).
The csv dataset has been found on data.world, a data catalog platform that gather various open access datasets online. The file contains more than 45’000 rows and 26 columns, each colomn concerns one feature (such as the year of the brand, the model, the consumption per barrel, the highway mpg per fuel type and so on).
1.0.4 The method that is used, in broad terms, no details needed at this point. E.g. “Model based machine learning will help us quantifying the important factors on the sales”.
Among these columns, we have had to find a machine learning model that could help us quantify the importance of the features in predicting the make of the car. Various models will be tried for both supervised and unsupervised learnings.
1.0.5 An outlook: a short paragraph indicating from now what will be treated in each following sections/chapters. E.g. “in Section 3, we describe the data. Section 4 is dedicated to the presentation of the text mining methods…”
In the following sections, you will find 1st the description in the data, then in Section 2 the method used, in Section 3 the results, in Section 4 our conclusion and recommendations and finally in Section 5 our references. From now on, we will go through different sections. Section 2 will be dedicated in the data description in more depth, mentioning the variables and features, the instances, the type of data and eventually some missing data patterns. Then, the next section will cover Exploratory Data Analysis (EDA), where some vizualisations will be made in order to better perceive some patterns in the variables as well as potential correlation. After that, section 4 will be about the methods which will first be divided between Supervised and then Unsupervised in order to find a suitable model for our project. The results will be discussed right after and we will proceed with a conclusion, as well as recommendations and discussions. Finally, the references and appendix will be visible at the end of the report.
2 Data description
For this project, we selected a dataset focused on vehicle characteristics, available as a .csv file from data.world. You can access the dataset via the following link: data.world. It includes a total of 26 features describing 45,896 vehicle models from 141 brands released between 1984 and 2023. Below is a table providing an overview of the available features and their descriptions. You can find a deeper description of the data in ?@sec-Annex. The analysis of the missing values will be dealt with in the data cleaning section.
2.0.1 Description of the features
3 EDA
Now that we have presented the variables contained in the dataset, let’s try to understand the data structure, characteristics and underlying patterns thanks to an EDA.
3.0.1 Dataset overview
3.0.2 Columns description
Let’s have a quick look at the characteristics of the columns. You will find more statistical details about it in the annexe. ::: {.cell}
:::
Show the code
# Load necessary packages
library(dplyr)
library(knitr)
library(kableExtra)
# Assuming your dataset is named 'data'
# Get the number of rows and columns
num_rows <- nrow(data)
num_cols <- ncol(data)
# Get the frequency of column types
col_types <- sapply(data, class)
type_freq <- table(col_types)
char_count <- ifelse("character" %in% names(type_freq), type_freq["character"], 0)
num_count <- ifelse("numeric" %in% names(type_freq), type_freq["numeric"], 0)
# Create a summary data frame
summary_df <- data.frame(
Name = "data",
Number_of_rows = num_rows,
Number_of_columns = num_cols,
Character = char_count,
Numeric = num_count,
Group_variables = "None"
)
# Display the summary using kable
kable(summary_df, format = "html", table.attr = "style='width:50%;'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| Name | Number_of_rows | Number_of_columns | Character | Numeric | Group_variables |
|---|---|---|---|---|---|
| data | 45896 | 26 | 8 | 18 | None |
The dataset that we are working with contains approx. 46’000 rows and 26 columns, each row representing a model from one of the 141 brands. From the data overview, we can see that most of our features are concerning the consumption of the cars. If we now check more in details in the annex, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will handle these issues in the section “Data cleaning”.
3.0.3 Exploration of the distribution
Now let’s explore the distribution of the numerical features.
Show the code
# melt.data <- melt(data)
#
# ggplot(data = melt.data, aes(x = value)) +
# stat_density() +
# facet_wrap(~variable, scales = "free")
plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 As the majority of models in our dataset are neither electric vehicles (EVs) nor hybrid cars and because of the nature of some column concerning only these two types of vehicles, the results are showing numerous zero values in several columns. This issue will be addressed during the data cleaning process. Additionally, certain features, such as “Engine Cylinders,” are numerically discrete, as illustrated in the corresponding plot.
3.0.4 Outliers Detection
In order identify occurences that deviate significantly for the rest of the observations, and in order to potentially improve the global quality of the data, we have decided to analyse outliers thanks to boxplots. Here are the result on the numerical columns of the dataset:
Show the code
#tentative boxplots
data_long <- data %>%
select_if(is.numeric) %>%
pivot_longer(cols = c("ID",
"model_year",
"estimated_Annual_Petrolum_Consumption_Barrels", "City_MPG_Fuel_Type_1",
"highway_mpg_fuel_type_1",
"combined_MPG_Fuel_Type_1",
"City_MPG_Fuel_Type_2",
"highway_mpg_fuel_type_2",
"combined_MPG_Fuel_Type_2",
"time_to_Charge_EV_hours_at_120v_",
"charge_time_240v",
"range_for_EV",
"range_ev_city_fuel_type_1",
"range_ev_city_fuel_type_2",
"range_ev_highway_fuel_type_1",
"range_ev_highway_fuel_type_2"), names_to = "variable", values_to = "value")
ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
geom_boxplot(outlier.size = 0.5) + # Make outlier points smaller
facet_wrap(~ variable, scales = "free_y") + # Each variable gets its own y-axis
theme_minimal() +
theme(legend.position = "none", # Hide the legend
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 0),strip.text = element_text(size = 7)) + # Rotate x-axis labels
labs(title = "Outlier Detection Boxplots", x = "", y = "Values")Most of our boxplots are showing extreme values. Again, this is due to the small amount of EV and hybrid cars in our dataset compared to the rest of the models and due to the nature of some features, concerning only those type of vehicles. ::: {.cell}
:::
3.0.5 Number of models per make
Now let’s check how many models per make we have in our dataset. In order to have a clear plot, we have decided to keep the top 20 brands among all the make on the graph. All the remaining makes are accessible on the table just below.
Show the code
#Number of occurences/model per make
nb_model_per_make <- data %>%
group_by(make, model) %>%
summarise(Number = n(), .groups = 'drop') %>%
group_by(make) %>%
summarise(Models_Per_Make = n(), .groups = 'drop') %>%
arrange(desc(Models_Per_Make))
#makes with only 1 model
only_one_model_cars <- nb_model_per_make %>%
filter(Models_Per_Make == 1 )
#table globale
datatable(nb_model_per_make,
rownames = FALSE,
options = list(pageLength = 5,
class = "hover",
searchHighlight = TRUE))Show the code
# Option to limit to top 20 makes for better readability
top_n_makes <- nb_model_per_make %>% top_n(20, Models_Per_Make)
#plot
ggplot(top_n_makes, aes(x = reorder(make, Models_Per_Make), y = Models_Per_Make)) +
geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
labs(title = "Models per Make (Top 20)",
x = "Make",
y = "Number of Models") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(hjust = 1, size = 10),
plot.title = element_text(size = 14)) +
coord_flip() # Flip coordinates for better readabilityOn the 141 brands, we notice that only 13 brands have more than 100 models in the dataset. Among these, only two of them (Mercedes-Benz and BMW) have more than 400 models presents. Let’s now check the number of make with 1 model in the dataset.
Show the code
#table only 1 model
datatable(only_one_model_cars,
rownames = FALSE,
options = list(pageLength = 5,
class = "hover",
searchHighlight = TRUE))Show the code
ggplot(only_one_model_cars, aes(x = make, y = Models_Per_Make)) +
geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
labs(title = "Makes with only 1 model in the dataset",
x = "Make",
y = "Number of Models") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(hjust = 1, size = 10),
plot.title = element_text(size = 14)) +
coord_flip() # Flip coordinates for better readabilityTherefore, we can see that Mercendes-Benz and BMW have significantly more models in our dataset, which means that we are dealing with some imbalances in categories. Therefore, we need to be careful when doing predictions, as will may encounter bias toward these two majority classes. Therefore, there are few technics that can be used to deal with this problem, such as resampling technics, Ensemble Methods (RF, Boosting), tuning probability threshold,…..
https://chatgpt.com/c/09a66e4e-80c6-4fbd-bf4e-73a2b3e44afd
4 Data cleaning
In this section we will handle the missing value of our dataset to make sure that we have a clean dataset to perform our EDA and modeling. We will first visualize the missing values in our dataset and then clean the missing values in the columns that we will use for our analysis. We will also remove some rows and columns that are not relevant for our analysis.
Let’s have a look at the entire dataset and its missing values in grey.
We can see that overall, we do not have many missing values in proportion with the size of our dataset. However, we can see that some columns have a lot of missing values. Below we have the detail of the percentage of missing values by columns.
4.1 Dealing with the columns Engine Cylinders and Engine Displacement
As we can see we have missing in 6 columns. Let’s first have a closer look at the engine cylinders and engine displacement columns. They both have 484 missing values. After some data manipulation, we see that these 484 missing are all electric vehicles and that they all have missing values in the engine cylinders and engine displacement columns. Given that in our dataset we have 484 vehicles, we now that theses missing in these column only concerns electric vehicles. This make sense since electric vehicle do not have an combustion engine and therefore those categories are not really applicable. We will therefore replace all missing values in this two columns with “none”.
As we can see, we still have some missing in the columns “Fuel Type 2”, “Engine Description”, “Drive” and “Transmission”. Let’s investigate the missing in the column “Drive”.
4.2 Dealing with the column Drive, Transmission and Engine Description
We decided to drop the brand with more than 10% of missing values in the “Drive” column. After this operation, we also removed the 8 observations that remained with missing values in the “Transmission” column. We decided to drop the column engine description since it contains missings values for more than a third of our observation.
4.3 Final dataset
The dataset is now cleaned and does not contain any missing values. It contains 42240 observations, 18 features and 129 brands. We renamed the columns and stored it in a csv file (data_cleaned.csv). However, for some models, we need to tackle the unbalanced classes in the target variable. For this reason we also created a new csv file for which we drop the make with less than 10 models (data_cleaned_reduced.csv). This dataset contains 42061 observations, 18 features and 66 brands.
Here are the two cleaned datasets on which we are working on from now on.
Cleaned Dataset
| Name | Number_of_rows | Number_of_columns | Character | Numeric | Group_variables |
|---|---|---|---|---|---|
| data_cleaned | 42240 | 18 | 8 | 5 | None |
Cleaned and Reduced Dataset